Problem Statement

Premiums paid by customer is the major revenue source for insurance companies. Default in premium payments results in significant revenue losses and hence insurance companies would like to know upfront which type of customers would default premium payments. The objective of this project is to predict the probability that a customer will default the premium payment, so that the insurance agent can proactively reach out to the policy holder to follow up for the payment of premium.

Data Dictionary

  1. id: Unique customer ID
  2. perc_premium_paid_by_cash_credit: What % of the premium was paid by cash payments?
  3. age_in_days: age of the customer in days
  4. Income: Income of the customer
  5. Marital Status: Married/Unmarried, Married (1), unmarried (0)
  6. Veh_owned: Number of vehicles owned (1-3)
  7. Count_3-6_months_late: Number of times premium was paid 3-6 months late
  8. Count_6-12_months_late: Number of times premium was paid 6-12 months late
  9. Count_more_than_12_months_late: Number of times premium was paid more than 12 months late
  10. Risk_score: Risk score of customer (similar to credit score)
  11. No_of_dep: Number of dependents in the family of the customer (1-4)
  12. Accommodation: Owned (1), Rented (0)
  13. no_of_premiums_paid: Number of premiums paid till date
  14. sourcing_channel: Channel through which customer was sourced
  15. residence_area_type: Residence type of the customer
  16. premium : Total premium amount paid till now
  17. default: Y variable - 0 indicates that customer has defaulted the premium and 1 indicates that customer has not defaulted the premium

Dataset has 79852 rows and 16 columns.

perc_premiums_paid and risk_score are a floats. Percentage with decimal point. sourcing_channel and residence_area_type is object. Non-numerical value. All other columns are integers. Numerical values.

No missing data.

Dropping ID. No value.

Two categorical columns.
residence_area_type as two possible unique values. sourcing_channel has 5 possible unique values.

Change age in days to years. Change column name to age.

prec_premium_paid_by_cash average is 31%. Range 0%-100%. age average is 51 years old. Range is 21 to 103. Income average is 208847. Range is 24030 to 90262600. Count_3-6_months_late average is .24. Range is 0 to 13. Count_6-12_months_late average is 0.078. Range is 0 to 17. Count_more_than_12_months_late average is 0.059. Range is 0 to 11. Veh_Owned averge is 1.99. Range is 1-3. risk_score average is 99.06. Range is 91.9 to 99.98. no_of_premiums_paid average is 10.86. Range is 2 to 60. premium average is 10924. Range is 1200 to 60000.

EDA

UNIVARIATE ANALYSIS

Data is right skewed. No outliers. 50% of customers pay less than 17% cash.

Boxplot indicates outliers. Older age. Normally distributed with mean of 51.

Right skewed.

Right skewed.

Right skewed.

Average 2 vehicles owned.

Average 2.5 dependents. No outliers.

Left skewed. Average 99%. Outliers in the lower range.

Right skewed. Average 10.86. Outliers in upper range.

Right skewed. Average 10924. Outliers in upper range.

Non-defaulted customers 93.7% Defaulted 6.3%.

Owned 50.1% Rented 49.9%

Married 49.9% Unmarried 50.1%

Vehicles owned between 1 and 3. Equally distributed. Average 2.5.

Number of dependents 1-24.8%, 2-24.9%, 3-25.3%, 4-24.9%

0 times premium was paid 3-6 months late 83.8% 1 time 11.1% 2 times 3.2% 3 times 1.2% 4 times 0.5% 5 times 0.2% 6 times 0.1%

0 times premium was paid 6-12 months late 95.1% 1 time 3.4% 2 times 0.9% 3 times 0.4% 4 times 0.2% 5 times 0.1%

0 times premium was paid more than 12 months late 95.3% 1 time 3.8% 2 times 0.6% 3 times 0.2% 4 times 0.1%

Bivariate Analysis

There are overlaps. No clear distinction in the distribution of variables for people who have defaulted and did not default.

Median age of defaulters is less than non-defaulters. Younger customers more likely to default. Outliers in both plots.

Defaulters had higher median of premiums paid for with cash. Cash paying customers more likely to default. Most non-defaulters

Median risk score is higher for non-defaulters. Less chance for default with higher risk scores.

Outliers in upper range. Defaulters and non-defaulters very similar for number of premiums paid.

Defaulters have a higher number of dependents.

Fairly even.

Higher the count the greater chance for default. 10-13 count is 100%.

Higher the count the greater chance for default. 12 and 17 count is 100%.

Higher the count the greater chance for default. 11 count is 100%.

No real differences indicated between married and non-married.

No real differences indicates with number of vehicles owned.

No real differences indicates in number of dependents.

No real differences indicates in accomodation.

Default is at 30% at 2 premiums paid then comes down to around 10% until 37. Highest chances of default are 38, 41, 43, 45, 47, 50, and 59.

Default percentage stays steady at around 6-10% throughout the observations.

The higher the percentage paid in cash, the higher the chance for default.

No real differences indicated.

Slightly higher chance of default C, D, and E.

C, D, and E appear to have higher chances of default.

Higher income less likely to default.

Model evaluation criterion:

Model can make wrong predictions as:

  1. Predicting a customer is is going to default, and the customer does not - Loss of resources
  2. Predicting a client will not default and does - Loss of income

Which case is more important?

How to reduce this loss: need to reduce False Negatives.

Will apply Linear Regression (with over-sampling, under-sampling, and regularization), Decision Tree, Random Forest, Bagging, Adaboost, Gradient Boosting, and XG Boosting. Will determine best performing model based on recall performance and find most important variables in predicting default.

Splitting into training and test sets

No missing values.

Encoded categorical values.

Model evaluation criterion:

Model can make wrong predictions as:

  1. Predicting a customer is going to default and the customer does not - Loss of resources
  2. Predicting a client will not default and does - Loss of income

Which case is more important?

How to reduce this loss (reduce False Negatives)?

Logistic Regression

Poor CV score.

Recall scoring poorly.

Oversampling

Much better recall performance.

Regularization

Worse recall performance.

Undersampling

Better recall performance.

DTREE, ADB, GMB, and XGB all have top scores followed by Bagging. RF has lowest.

Hyperparameter Tuning DTREE

Poor recall scoring on both training and test.

Poor recall scoring on training and test.

Hyperparameter Tuning for Bagging

Much better recall scoring on training and test.

Good recall scoring on training and test.

Hyperparameter Tuning Random Forest

Good recall scoring on test set. Worse on training.

Better recall performance on test set and worse on training.

Hyperparameter Tuning Adaboost

Poor recall performance.

No change in poor recall performance.

Hyperparameter Tuning GradientBoosting

Poor recall performance on test and training sets.

Still poor performance on recall test and training sets.

Hyperparameter Tuning XGBoost

Better recall performance on test and training sets.

Top factors from three best performing models

Business Recommendations

Predictive model built to:

  1. identify customers who will default on premiums.
  2. determine key factors that drive defaults.
  3. help company reduce premium defaults.